# Load library
library(ISLR)
# Create vector of length(mpg) containing all 0s
mpg01 = rep(0, length(Auto$mpg))
# Transforms to 1 all of the elements for which mpg is greater than median mpg
mpg01[Auto$mpg > median(Auto$mpg)] = 1
# Adds mpg01 to data set
Auto = data.frame(Auto, mpg01)
dim(Auto)
## [1] 392 10
cor(Auto[, -9])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## mpg01 0.8369392 -0.7591939 -0.7534766 -0.6670526 -0.7577566
## acceleration year origin mpg01
## mpg 0.4233285 0.5805410 0.5652088 0.8369392
## cylinders -0.5046834 -0.3456474 -0.5689316 -0.7591939
## displacement -0.5438005 -0.3698552 -0.6145351 -0.7534766
## horsepower -0.6891955 -0.4163615 -0.4551715 -0.6670526
## weight -0.4168392 -0.3091199 -0.5850054 -0.7577566
## acceleration 1.0000000 0.2903161 0.2127458 0.3468215
## year 0.2903161 1.0000000 0.1815277 0.4299042
## origin 0.2127458 0.1815277 1.0000000 0.5136984
## mpg01 0.3468215 0.4299042 0.5136984 1.0000000
I can examine the corelation coefficients to determine which variables appear to have a strong association with mpg01. Cylinders, displacement, horsepower and weight have corelation coefficents of -0.7591939, -0.7534766, -0.6670526 and -0.7577566 respectively. This indicates that there is a strong relationship between each of these variables and mpg01 because they are not close to 0. Acceleration, year and origin have coefficents that are closer to 0 so I ruled them out.
Here are some scatterplots of the data:
pairs(Auto)
Now I will create boxplots to show the relationship between mgp01 and cylinders, displacement, horsepower and weight.
mgp01 and cylinders:
boxplot(cylinders ~ mpg01, data = Auto, main = "Cylinders vs mpg01")
mgp01 and displacement:
boxplot(displacement ~ mpg01, data = Auto, main = "Displacement vs mpg01")
mgp01 and horsepower:
boxplot(horsepower ~ mpg01, data = Auto, main = "Horsepower vs mpg01")
mgp01 and weight:
boxplot(weight ~ mpg01, data = Auto, main = "Weight vs mpg01")
As shown in the box plots, there is an obvious association between the plotted variables. The boxes do not overlap. Here is a plot for Acceleration vs mpg01 for contrast.
mgp01 and acceleration:
boxplot(acceleration ~ mpg01, data = Auto, main = "Acceleration vs mpg01")
Acceleration was one of the features that I found would not be useful in predicting mpg01. The boxplot above reflects this as there is not a significant difference in acceleration based on given mpg01 values. The boxes overlap, making it hard to distinguish an association between the features.
All in all, I conclude that there is an association between mgp01 and cylinders, displacement, horsepower and weight.
# Create train to hold logical vector of True for even year and False for odd year
train = (Auto$year %% 2 == 0)
# Train data where instances have even years
Auto.train = Auto[train, ]
# Data not in train is test data, test data where instances have odd years
Auto.test = Auto[!train, ]
# mpg01.test values
mpg01.test = mpg01[!train]
library(MASS)
# Perform LDA on training data
lda.fit = lda(mpg01 ~ cylinders + displacement + horsepower + weight, data = Auto, subset = train)
lda.fit
## Call:
## lda(mpg01 ~ cylinders + displacement + horsepower + weight, data = Auto,
## subset = train)
##
## Prior probabilities of groups:
## 0 1
## 0.4571429 0.5428571
##
## Group means:
## cylinders displacement horsepower weight
## 0 6.812500 271.7396 133.14583 3604.823
## 1 4.070175 111.6623 77.92105 2314.763
##
## Coefficients of linear discriminants:
## LD1
## cylinders -0.6741402638
## displacement 0.0004481325
## horsepower 0.0059035377
## weight -0.0011465750
The output tells me that 45.7% of the training observations correspond to mpg containing a value below its median and 54.3% correspond to mpg containing a value above its median. I also see the group means which are the average of each predictor within each class. The coefficients of linear discriminants output provides the linear combination of cylinders, displacement, horsepower and weight that are used to form the LDA decision rule.
lda.pred = predict(lda.fit, Auto.test)
#Confusion matrix
table(lda.pred$class, mpg01.test)
## mpg01.test
## 0 1
## 0 86 9
## 1 14 73
The confusion matrix tells me that the model correctly predicted 86 + 73 = 159 observations. It was wrong for 9 + 14 = 23 observations. Hence, the error rate should be around 23 / 182 total observations = 12.64%.
Test error rate:
# Take the mean of when the prediction about mpg01 does not match mpg01.test data
mean(lda.pred$class != mpg01.test)
## [1] 0.1263736
Thus, the test error rate of the model obtained is 12.63736%.
# Perform QDA on training data
qda.fit = qda(mpg01 ~ cylinders + displacement + horsepower + weight, data = Auto, subset = train)
qda.fit
## Call:
## qda(mpg01 ~ cylinders + displacement + horsepower + weight, data = Auto,
## subset = train)
##
## Prior probabilities of groups:
## 0 1
## 0.4571429 0.5428571
##
## Group means:
## cylinders displacement horsepower weight
## 0 6.812500 271.7396 133.14583 3604.823
## 1 4.070175 111.6623 77.92105 2314.763
qda.pred = predict(qda.fit, Auto.test)
#Confusion Matrix
table(qda.pred$class, mpg01.test)
## mpg01.test
## 0 1
## 0 89 13
## 1 11 69
#Error Rate
mean(qda.pred$class != mpg01.test)
## [1] 0.1318681
The confusion matrix tells me that the model predicted correctly 89 + 69 = 158 observations. It was wrong for 11 + 13 = 24 obseravations. I get that the QDA predictions have an error rate of 13.18681% (24 / 182 total observations) which is slightly higher than the error rate for LDA predictions.
glm.fit = glm(mpg01 ~ cylinders + displacement + horsepower + weight, data = Auto, family = binomial, subset = train)
summary(glm.fit)
##
## Call:
## glm(formula = mpg01 ~ cylinders + displacement + horsepower +
## weight, family = binomial, data = Auto, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.48027 -0.03413 0.10583 0.29634 2.57584
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 17.658730 3.409012 5.180 2.22e-07 ***
## cylinders -1.028032 0.653607 -1.573 0.1158
## displacement 0.002462 0.015030 0.164 0.8699
## horsepower -0.050611 0.025209 -2.008 0.0447 *
## weight -0.002922 0.001137 -2.569 0.0102 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 289.58 on 209 degrees of freedom
## Residual deviance: 83.24 on 205 degrees of freedom
## AIC: 93.24
##
## Number of Fisher Scoring iterations: 7
glm.probs = predict(glm.fit, Auto.test, type = "response")
glm.probs[1:10]
## 30 31 32 34 35
## 9.571817e-01 9.382003e-01 9.245438e-01 3.325005e-01 3.490227e-02
## 36 37 38 39 40
## 6.394679e-02 1.194775e-01 6.861792e-02 3.183584e-05 1.030257e-05
glm.pred = rep(0, length(glm.probs))
glm.pred[glm.probs > 0.5] = 1
#Confusion Matrix
table(glm.pred, mpg01.test)
## mpg01.test
## glm.pred 0 1
## 0 89 11
## 1 11 71
# Error Rate
mean(glm.pred != mpg01.test)
## [1] 0.1208791
The confusion matrix tells me that the model predicted correctly 89 + 71 = 160 observations. It was wrong for 11 + 11 = 22 observations. I expect the error rate to be around 22 / 182 = 12.09%. Thus, the test error rate of the model obtained is 12.08791% which is the smallest error rate out of the three models thus far.
Let’s try for K = 1.
library(class)
# Train set
train.X = cbind(Auto$cylinders, Auto$displacement, Auto$horsepower, Auto$weight)[train, ]
# Test set
test.X = cbind(Auto$cylinders, Auto$displacement, Auto$horsepower, Auto$weight)[!train, ]
train.mpg01 = mpg01[train]
set.seed(1)
knn.pred = knn(train.X, test.X, train.mpg01, k = 1)
# Confusion Matrix
table(knn.pred, mpg01.test)
## mpg01.test
## knn.pred 0 1
## 0 83 11
## 1 17 71
# Error Rate
mean(knn.pred != mpg01.test)
## [1] 0.1538462
For K=1, I got a test error rate of 15.38462%.
Let’s try for K = 3.
knn.pred = knn(train.X, test.X, train.mpg01, k = 3)
# Confusion Matrix
table(knn.pred, mpg01.test)
## mpg01.test
## knn.pred 0 1
## 0 84 9
## 1 16 73
# Error Rate
mean(knn.pred != mpg01.test)
## [1] 0.1373626
For K=3, I got a test error rate of 13.73626% which is 1.6% lower than when K=1.
Let’s try for K = 10.
knn.pred = knn(train.X, test.X, train.mpg01, k = 10)
# Confusion Matrix
table(knn.pred, mpg01.test)
## mpg01.test
## knn.pred 0 1
## 0 78 7
## 1 22 75
# Error Rate
mean(knn.pred != mpg01.test)
## [1] 0.1593407
For K=10, I got a test error rate of 15.93407% which is 2.2% higher than when K=3. Increasing K this far does not provide further improvements, it actually increased the error rate. Out out these three K values (K=1, K=3, K=10), it seems that K=3 performs the best with a test error rate of 13.73626%.
Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore the logistic regression, LDA, and KNN models using various subsets of the predictors. Describe your findings.
Initial set up of test and train data.
# Create vector of length(crim) containing all 0s
crim01 = rep(0, length(Boston$crim))
# Transforms to 1 all of the elements for which crim is greater than crim mpg
crim01[Boston$crim > median(Boston$crim)] = 1
# Adds crim01 to data set
Boston = data.frame(Boston, crim01)
dim(Boston)
## [1] 506 15
Explore data graphically to investigate association between crim01 and other features.
Boxplots for all but chas and crim variable:
par(mfrow=c(2,6))
boxplot(zn ~ crim01, data = Boston, main = "zn vs crim01")
boxplot(indus ~ crim01, data = Boston, main = "indus vs crim01")
boxplot(nox ~ crim01, data = Boston, main = "nox vs crim01")
boxplot(rm ~ crim01, data = Boston, main = "rm vs crim01")
boxplot(age ~ crim01, data = Boston, main = "age vs crim01")
boxplot(dis ~ crim01, data = Boston, main = "dis vs crim01")
boxplot(rad ~ crim01, data = Boston, main = "rad vs crim01")
boxplot(tax ~ crim01, data = Boston, main = "tax vs crim01")
boxplot(ptratio ~ crim01, data = Boston, main = "ptratio vs crim01")
boxplot(black ~ crim01, data = Boston, main = "black vs crim01")
boxplot(lstat ~ crim01, data = Boston, main = "lstat vs crim01")
boxplot(medv ~ crim01, data = Boston, main = "medv vs crim01")
All variables show a trend to crim01, except rm which has some difference among the crimes situation but most its population lies in the same range values.
Now I will consider variable chas which is Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) by creating a bar plot.
par(mfrow=c(1,1))
aux = table(Boston$chas, Boston$crim01)
barplot(aux, beside = T, legend=rownames(aux))
The chas variable doesn’t show much difference or variability for crime situation.
Thus, I conclude that the relevant variables are: zn, indus, nox, age, dis, rad, tax, ptratio, black, lstat and medv.
# create vector with numbers of first half of the observations
train = 1:(length(Boston$crim) / 2)
# create vector with numbers of second half of the observations
test = (length(Boston$crim) / 2 + 1):length(Boston$crim)
# Train data - pick out submatrix of data corresponding to first half of observations
Boston.train = Boston[train, ]
# Test data - pick out submatrix of data corresponding to second half of observations
Boston.test = Boston[test, ]
# crim01.test values
crim01.test = crim01[test]
Logistic Regression
glm.fit = glm(crim01 ~ . - crim01 - crim - chas - rm, data = Boston, family = binomial, subset = train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glm.fit)
##
## Call:
## glm(formula = crim01 ~ . - crim01 - crim - chas - rm, family = binomial,
## data = Boston, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.81776 -0.10638 -0.00005 0.09702 2.89185
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -87.058867 16.744100 -5.199 2.00e-07 ***
## zn -0.687489 0.156682 -4.388 1.14e-05 ***
## indus 0.258780 0.138197 1.873 0.061132 .
## nox 79.036724 16.298438 4.849 1.24e-06 ***
## age 0.010631 0.017062 0.623 0.533235
## dis 3.437444 0.768647 4.472 7.75e-06 ***
## rad 2.225401 0.506385 4.395 1.11e-05 ***
## tax -0.004975 0.006387 -0.779 0.435988
## ptratio 0.899155 0.240637 3.737 0.000187 ***
## black -0.012453 0.006066 -2.053 0.040068 *
## lstat 0.238713 0.080138 2.979 0.002894 **
## medv 0.244346 0.093895 2.602 0.009259 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 329.367 on 252 degrees of freedom
## Residual deviance: 79.371 on 241 degrees of freedom
## AIC: 103.37
##
## Number of Fisher Scoring iterations: 10
glm.probs = predict(glm.fit, Boston.test, type = "response")
glm.pred = rep(0, length(glm.probs))
glm.pred[glm.probs > 0.5] = 1
#Confusion Matrix
table(glm.pred, crim01.test)
## crim01.test
## glm.pred 0 1
## 0 67 22
## 1 23 141
# Error Rate
mean(glm.pred != crim01.test)
## [1] 0.1778656
For this model of logistic regression, the confusion matrix tells me that the model predicted 67 + 141 = 208 observations correctly. It was wrong for 23 + 22 = 45 observations. It has a test error rate of 45 / 253 = 17.79 %.
LDA
# Perform LDA on training data
lda.fit = lda(crim01 ~ . - crim01 - crim - chas - rm, data =Boston, subset = train)
lda.fit
## Call:
## lda(crim01 ~ . - crim01 - crim - chas - rm, data = Boston, subset = train)
##
## Prior probabilities of groups:
## 0 1
## 0.6442688 0.3557312
##
## Group means:
## zn indus nox age dis rad tax
## 0 17.4815951 7.042454 0.4702902 54.68957 4.843857 4.239264 300.4417
## 1 0.2444444 13.959778 0.6119889 85.67889 2.895154 5.155556 358.0111
## ptratio black lstat medv
## 0 17.82086 388.3934 9.88589 25.17362
## 1 18.08778 355.2777 14.06389 22.73889
##
## Coefficients of linear discriminants:
## LD1
## zn 0.0006671759
## indus 0.0254060977
## nox 9.5487431942
## age 0.0112365193
## dis 0.0886532955
## rad 0.2209812111
## tax 0.0021114452
## ptratio 0.2580804665
## black -0.0051444312
## lstat 0.0262265345
## medv 0.0732634873
lda.pred <- predict(lda.fit, Boston.test)
#Confusion Matrix
table(lda.pred$class, crim01.test)
## crim01.test
## 0 1
## 0 80 21
## 1 10 142
# Error Rate
mean(lda.pred$class != crim01.test)
## [1] 0.1225296
For this LDA model, the confusion matrix tells me that the model predicted 80 + 142 = 122 observations correctly. It was wrong for 10 + 21 = 31 observations. It has a test error rate of 31 / 253 = 12.25 % which is over 5 % lower than the error rate for the logistic regression model.
KNN Let’s try for K = 1. Variables: zn, indus, nox, age, dis, rad, tax, ptratio, black, lstat and medv
# Train set
train.X = cbind(Boston$zn, Boston$indus, Boston$nox, Boston$age, Boston$dis, Boston$rad, Boston$tax, Boston$ptratio, Boston$black, Boston$lstat, Boston$medv)[train, ]
# Test set
test.X = cbind(Boston$zn, Boston$indus, Boston$nox, Boston$age, Boston$dis, Boston$rad, Boston$tax, Boston$ptratio, Boston$black, Boston$lstat, Boston$medv)[test, ]
train.crim01 = crim01[train]
knn.pred = knn(train.X, test.X, train.crim01, k = 1)
# Confusion Matrix
table(knn.pred, crim01.test)
## crim01.test
## knn.pred 0 1
## 0 85 111
## 1 5 52
#Error Rate
mean(knn.pred != crim01.test)
## [1] 0.458498
For this KNN model with K = 1, the confusion matrix tells me that the model predicted 85 + 52 = 137 observations correctly. It was wrong for 5 + 111 = 116 observations. It has a test error rate of 116 / 253 = 45.85 %. This is a very poor model.
Let’s try for K = 3.
knn.pred = knn(train.X, test.X, train.crim01, k = 3)
# Confusion Matrix
table(knn.pred, crim01.test)
## crim01.test
## knn.pred 0 1
## 0 84 59
## 1 6 104
#Error Rate
mean(knn.pred != crim01.test)
## [1] 0.256917
For this KNN model with K = 3, the confusion matrix tells me that the model predicted 84 + 104 = 188 observations correctly. It was wrong for 6 + 59 = 65 observations. It has a test error rate of 65 / 253 = 25.69 %. This is a still a poor model but is better than when K = 1.
Let’s try for K = 10.
knn.pred = knn(train.X, test.X, train.crim01, k = 10)
# Confusion Matrix
table(knn.pred, crim01.test)
## crim01.test
## knn.pred 0 1
## 0 83 21
## 1 7 142
#Error Rate
mean(knn.pred != crim01.test)
## [1] 0.1106719
For this KNN model with K = 10, the confusion matrix tells me that the model predicted 83 + 142 = 225 observations correctly. It was wrong for 7 + 21 = 28 observations. It has a test error rate of 28 / 253 = 11.07 %. This is the best model so far.
Out of all of the KNN models, K = 10 has the lowest test error rate at 11.07 %. Compared to the logistic regression model and LDA model, KNN K = 10 has the lowest test error rate and performs the best.
logistic function representation and logit representation for logistic regression model are equivalent.
The method with the lower test error rate, meaning better predictions for observations it has never seen, should be used. I do not care whether the method has classified training observations correctly. I care if the method will be able to classify future, unanalyzed data correctly.
I realize that when K = 1 the training data error rate is always 0. This is because we use single observations (itself) to classify data points. I also know that for KNN the average error rate is 18%. Thus, (training data error rate + test data error rate) / 2 = 18%. This means that for KNN with K = 1, the test error rate is 36%. The test error rate for logistic regression is 30%. I would use logistic regression to classify new observations because the test error rate is lower.
I know that the quantity p(x)/[1-p(x)] is the odds. In this problem, p(x)/[1-p(x)] = 0.37. However, we want p(x) which represents the fraction of people that will default. We can solve for p(x) as follows:
p(x)/[1-p(x)] = 0.37
p(x) = 0.37 [1-p(x)]
p(x) = 0.37 - 0.37 p(x)
p(x) + 0.37 p(x) = 0.37
p(x) (1 + 0.37) = 0.37
p(x) = 0.37 / (1 + 0.37) = 0.27
Therefore, we have on average a fraction of 27% of people defaulting with an odds of 0.37.
Here we are given p(x) = 16%. I need to plug this into the quantity for odds, p(x)/[1-p(x)], to find the odds that she will default.
0.16 / [1 - 0.16] = 0.16 / 0.84 = 0.19
Thus, she has an odds of 19% of defaulting.